Exploring the Simpson’s Paradox Within the Penguin Dataset

Nice things

I learn new quatro things…
Author
Affiliation

me

University

Published

January 28, 2025

Keywords

Quarto, Paradox, Data Analysis

This Quarto document serves as a practical illustration of the concepts covered in the productive workflow online course. It’s designed primarily for educational purposes, so the focus is on demonstrating Quarto techniques rather than on the rigor of its scientific content.

1 Introduction

This document offers a straightforward analysis of the well-known penguin dataset. It is designed to complement the Productive R Workflow online course.

You can read more about the penguin dataset here.

Let’s load libraries before we start!

Show the code
# load the tidyverse

library(tidyverse)

library(hrbrthemes) # ipsum theme for ggplot2 charts

library(patchwork) # combine charts together

library(DT) # interactive tables

library(knitr) # static table with the kable() function

library(plotly) # interactive graphs
library(htmltools)


# Define the custom ggplot theme
my_theme <- function() {
  theme_ipsum() +
    theme(
      plot.title = element_text(color = "#69b3a2", size = 18, face = "bold"),
      axis.text.x = element_text(size = 7),
      axis.text.y = element_text(size = 7)
    )
}

2 Loading data

The dataset has already been loaded and cleaned in the previous step of this pipeline.

Let’s load the clean version, together with a few functions available in functions.R.

Show the code
# Source functions

source(file="functions/functions.R")

# Read the clean dataset

data <- readRDS(file = "input/clean_data.rds")

Note that bill_length_mm and bill_depth_mm have the following signification.

Bill measurement explanation

In case you’re wondering how the original dataset looks like, here is a searchable version of it, made using the DT package:

Show the code
DT::datatable(
  data, 
  options = list(pageLength = 3), 
  filter = "top",
  class = 'cell-border stripe hover compact',
  caption = htmltools::tags$caption(
    style = 'caption-side: bottom; text-align: center;',
    'Table 1: ', em('Penguin stuff')
  )
)
Show the code
data <- data %>%
  mutate(
    bill_depth_mm = as.numeric(bill_depth_mm) # Convert to numeric
  )
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `bill_depth_mm = as.numeric(bill_depth_mm)`.
Caused by warning:
! NAs introduced by coercion

3 Bill Length and Bill Depth

Now, let’s make some descriptive analysis, including summary statistics and graphs.

What’s striking is the slightly negative relationship between bill length and bill depth. One could definitely expect the opposite.

Show the code
p <- data %>%

ggplot(

aes(x = bill_length_mm, y = bill_depth_mm)

) +

geom_point(color="#69b3a2") +

labs(

x = "Bill Length (mm)",

y = "Bill Depth (mm)",

title = paste("Surprising relationship?")

) +

my_theme()

ggplotly(p)

Relationship between bill length and bill depth. All data points included.


It is also interesting to note that bill length a and bill depth are quite different from one specie to another. The average of a variable can be computed as follow:

\[{displaystyle Avg={frac {1}{n}}sum _{i=1}^{n}a_{i}={frac {a_{1}+a_{2}+cdots +a_{n}}{n}}}\]

bill length and bill depth averages are summarized in the 2 tables below.

Show the code
#| layout-ncol: 2

# Calculate the average bill length per species
bill_length_per_specie <- data %>%
  group_by(species) %>%
  summarise(
    average_bill_length = mean(bill_length_mm, na.rm = TRUE)
  )
#bill_length_per_specie
# Display the bill length table
kable(bill_length_per_specie)
species average_bill_length
Adelie 38.80872
Chinstrap 48.83382
Gentoo 47.50488
Show the code
# Calculate the average bill depth per species


bill_depth_per_specie <- data %>%
  group_by(species) %>%
  summarise(
    average_bill_depth = mean(bill_depth_mm, na.rm = TRUE)
  )
#bill_depth_per_specie
# Display the bill depth table
kable(bill_depth_per_specie)
species average_bill_depth
Adelie 18.34228
Chinstrap 18.42059
Gentoo 14.98211
Show the code
# Extract and round the average bill length for the Adelie species
bill_length_adelie <- bill_length_per_specie %>%
  filter(species == "Adelie") %>%
  pull(average_bill_length) %>%
  round(2)

For instance, the average bill length for the specie Adelie is 38.81.

Now, let’s check the relationship between bill depth and bill length for the specie Adelie on the island Torgersen:

Show the code
# Use the function in functions.R

p1 <- create_scatterplot(data, "Adelie", "#6689c6")
p1=p1+my_theme() 

p2 <- create_scatterplot(data, "Chinstrap", "#e85252")
p2=p2+my_theme() 
p3 <- create_scatterplot(data, "Gentoo", "#9a6fb0")
p3=p3+my_theme() 

p1 + p2 + p3

There is actually a positive correlation when split by species.
 

A work by Britta Meyer

britta.meyer-1@uni-hamburg.de